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1. Introduction 



o 

(N 
(N 

. The Monte Carlo method was invented by John von Neumann in the Los Alamos Laboratory, 
in 1947, for solving the problem of neutron diffusion in fissionable material, and thus helping 
Edward Teller to build the hydrogen bomb (see Metropolis (1987)). 

The Monte Carlo method relies on independent samples from a probability distribution, to 
approximate an integral with respect to that distribution. Often, however, it is impossible or 
impractical to obtain such independent samples. One may still be able to construct a Markov 
chain with the target distribution as its stationary distribution. It is then possible to obtain a 
series of dependent samples by sampling from the Markov chain. This method is called Markov 
chain Monte Carlo (MCMC). 

Let Xi, X2, ■ ■ ■ , be a countable state, time homogeneous, ergodic Markov chain, taking values 
in Q, and having stationary distribution ir. Suppose that we are interested in computing K n f 
for some / : Q — > R. Then we usually make the approximation 

E f « ^*= f ° +1 /m (1 1) 

for some t > ("burn-in time"). For t fixed, and iV — > 00, this average converges to K n f by 
the ergodic theorem. However it is not clear how much this convergence is slowed down due to 
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the dependence of the samples. Consequently, an important question in practice is, how large 
should N be so that this approximation is correct to a certain level of precision? Practitioners 
often disregard this question, and keep silent about error bounds. 

It is well known that the average in (1.1) tends to converge faster for fast mixing chains 
than for slow mixing ones. But until now, there have been few practically applicable results 
that relate the error in (1.1) to the mixing time and the spectral gap of the chain (for one 
such result, see Lezaud (1998a)). Most of the results in the literature are based on asymptotic 
convergence of the average to normal distribution. As we will see from our simulation results, 
such asymptotic bounds may underestimate the error for finite sample sizes. We will briefly 
review some of these results in Section 3. 

Concentration inequalities can establish non- asymptotic error bounds for MCMC empirical 
averages of the form (1.1). For reversible chains, the speed of convergence is determined by 
the spectral gap when a sufficiently large burn-in time is chosen. In the non-reversible case the 
mixing time of the chain gives an upper bound on the speed of convergence. Paulin (2012a) 
establishes Hoeffding and Bernstein inequalities for both of these cases. In Section 6 of Lezaud 
(1998b), Bernstein inequalities are proven for Markov chains with general state space. 

The purpose of this paper is to present these inequalities in a simple way, and show their 
applicability on simulation results of various models. We first look at simulations on lattice 
models in statistical physics, where the spectral gap and the mixing time are known. Then 
we present a case study of Bayesian inference, where the mixing properties of the chain are 
unknown. We have found that our bounds compare favorably with the existing asymptotic 
results. 

1.1. Preliminary definitions 

We define the mixing time of a time homogeneous chain as in Section 4.5 and 4.6 of Levin, Peres and Wilmer 
(2009). Let Xi, X 2 , X 3 , ... be a countable state, time homogeneous, irreducible, aperiodic Markov 
chain with transition matrix P, state space Q, and stationary distribution ir. 
Let us denote 



d(t) := supdrv [P {x, •), 7r) , 
where drv is the total variation distance. The mixing time of the chain is then defined as 



<ndx(e) := min{t : d{t) < e} 



and 




If Q is finite, we can write the eigenvalues of the transition matrix P as 



1 = Ai > A 2 > ... > A| Q | > -1, 



and we denote 7 := 1 — A2 the spectral gap. 



The mixing time and the spectral gap are related by the following inequality. 

Lemma 1.1. For reversible, irreductible, aperiodic chains in discrete time with finite state space 
fl, denote 
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for which 

Proof. The first inequality follows from Theorem 12.4 of Levin, Peres and Wilmer (2009), for 
the second one, let e = 4~ n , then t m ; x (e) < nt mix , and n — > oo gives this bound. □ 

For a random vector with distribution n, there are many ways to define a Markov chain that 
has 7r as stationary distribution. 

Two of the most frequently used are the Metropolis-Hastings chain and the Gibbs sampler. 
Here we define the most frequently used variants of these (based on Chapter 3 of Levin, Peres and Wilmer 
(2009)). 

Definition (Metropolis-Hastings chain). Letfl be any finite set, and'fy an irreducible transition 
matrix. The Metropolis-Hastings chain modifies $ to obtain a chain with stationary distribution 

TV. 

The transition matrix of the Metropolis-Hastings chain for a probability distribution ir and 
symmetric transition matrix is defined as 

f *(*>y)™{i>5i§Sfe§} tfv**> (l4) 

Remark 1.1. In most of the practical situations ir(x) = h(x)/Z, with Z being a normalization 
constant that is difficult to determine. A very important feature of the Metropolis- Hastings chain 
is that the transition probabilities only depend on n through the ratio ^4, which is independent 
of Z . The same holds true for the conditional probabilities in the case of the Gibbs sampler. 

The Gibbs sampler is a special case of the Metropolis-Hastings chain, when one can directly 
sample from the conditional distribution of each of the variables given the rest. 

Definition (Gibbs sampler chain). Assume thatS is a finite set, V is a set of random variables, 
Q = S v , and let n be a distribution on Q. Then we define the Gibbs sampler chain as picking 
one variable in V uniformly at random, and resampling its value conditionally on the values on 
the rest of the variables. 

2. Results 

In this section, we present concentration inequalities that give non-asymptotic bounds on the 
approximation (1.1). We state Hoeffding and Bernstein inequalities for both reversible and 
non-reversible chains. 



2.1. Reversible chains 



For reversible chains with countable state spaces, the following version of Hoeffding inequality 
holds (Theorem 1 of Leon and Perron (2004), the current form follows Theorem 2.7 of Paulin 
(2012a)): 
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Theorem 2.1 (Hoeffding inequality for reversible Markov chains). Let X = (Xi, . . . , X^) 
be a time homogeneous, reversible, irreducible, aperiodic Markov chain taking values in some 
countable state space Q, with stationary distribution n. Let A2 be the second largest eigenvalue 
of P (X2 = 1 — t m ix the mixing time of the chain, and f : Q — > [a, b] . Let to > ( u burn-in 



time"), and denote Z :- 
and any t > 0, 



Si=tp f( X i) 

N-t 



, and let X' = max(0, A2). Then for any initial distribution 



F[Z>EJ + t],F[Z<Kf-t] 



(2.1) 



< 



exp 



1 - A' 



1 + A 



7 (N-t )t 2 /(b 



inf e 

0<e<l 



I tQ , 



Now we present a Bernstein-type result for countable state reversible chains (Corollary 2.10 
of Paulin (2012a), which is based on the proof of Theorem 1.1 of Lezaud (1998a), see also 
Lezaud (1998b)): 

Theorem 2.2 (Bernstein inequality for reversible Markov chains). Let X = (Xi, . . . , Xn) 
be a time homogeneous, reversible, irreducible, aperiodic Markov chain taking values in some 
countable state space Q, with stationary distribution n, spectral gap 7, and mixing time t m i x . 
Suppose that f : Q — > [—C,C], denote Vf := Var n (f) > 0, and let 



C := sup \f(x) - E w f\ < \E w f\ + C<2C. 
Let t > ("burn-in time"), and define Z := -^^3^ - , then for t > 0, 



P[Z>E ff / + t],P[^<E ff /-t] 

. 7/ 5 r (N-tv)^ 

< e 11 exp 

< e 7//5 exp 



4V f (l + h(5tC/V f ))) 
(N-t )t 2 7 



inf gL'mixt 6 ) 

0<e<l 



I tQ 

: L *mix( e 



with 



AV f + IOC ■ t 
1 



+ inf e 

0<e<l 



I *o 



h(x) := - (VTT^ - (1 - x/2)) . 



Define the asymptotic variance, a 2 , as 



a 2 := lim - Var n (f(X 1 ) + ... + f(X N )) 



then the following bound holds: 



(2.2) 



(2.3) 
(2.4) 

(2.5) 
(2.6) 



P \Z > E w f + t] , P \Z < f - t] < inf el d^IJ + e 

J 0<e<l 



7/5 



• exp 



/ 



-{N-to) 



(a 2 + key + Aa 2 K'tC - [a 2 + HC^j 



2a 2 K' 



t 



(2.7) 



with K' :-- 



wv f _ 5 
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Remark 2.1. inf < e <i eL'mixMJ can be further bounded by 4 LwJ . In many situations, Markov 
chains have a cutoff, which means that instead of geometrical decay, £ m i x (e)/£ m i x (l — e) — >■ 1 for 
any < e < 1/2 as the system size tends to infinity (see Figure 1 of Lubetzky and Sly (2009)). 
In such cases, choosing t to be slightly larger (or a few times larger) than t mix is sufficient. 



2.2. Non-reversible chains 

Most MCMC methods use reversible chains, in particular, the Metropolis-Hastings algorithm 
and the Gibbs sampler defined previously are reversible. However, using non-reversible chains 
can speed up the mixing time in some cases, for an example, see Diaconis, Holmes and Neal 
(2000). Therefore it is of interest to prove Hoeffding and Bernstein inequalities without assuming 
reversibility. 

For later use, we define the following quantities (see also Proposition 1.2 of Paulin (2012a)): 
C£ := inf t mix (e/2)/(l - e) 2 < 2.62t mix , (2.8) 

0<e<l 

CS' := inf t mix (6/2)/(l - V~ef < 4.43t mix , (2.9) 

0<e<l 

Vmin (to) := mf • < 4"^J . (4/3)t mix . (2.10) 

o<e<i 1 — e 

For Markov chains with a cutoff, t™£ ps t™™' ps t mix . 

The following two theorems are from Paulin (2012a) (Corollaries 2.4. and 2.9): 

Theorem 2.3 (Hoeffding inequality for Markov chains). First let X = (X%, . . . , X^) be a time 
homogeneous Markov chain taking values in some countable space Q. Suppose that f : Q — > [a,b]. 

Let to > ("burn-in time"), denote Z := — ^%^r^ — ■ Then for every t > 0, 



p ( z > em) + ^ = a) + 1), p (z < emi - ^ ~ a) -t) (2.n; 



N — t r v - N-t 



-2(N-t )t 2 \ ( -(N-t )f 

< eX P -71 X9.rr.in < eX P 



Remark 2.2. We give a short direct proof for this result in the Appendix. 

Theorem 2.4 (Bernstein inequality for Markov chains). Let X = (Xi, ...,Xn) be a homo- 
geneous Markov chain taking values in some countable space Q, with stationary distribution ir. 

Let f : Q — > [—C,C], let t > ("burn-in time"), and denote Z := — ^j^— ~ ■ Let C be 
defined as in (2.2). Denote 

N 

V := EC/W-EW) 2 . (2.12) 

i=t +l 

Then for every t > 0, 

p { z - EM + V n^T q + > p ( Z " EM - frf - *) (2 - 13) 

-t 2 (N-t ) 2 

< exp ' 



< 



exp 



t^{8V + AV2(N-t )C't) 

-t 2 (N-t f 
t mix (35.5V + 25.1(iV - t )C't) 
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Remark 2.3. This form of Theorem 2.2 and 2.4 follows from Corollary 2.11, and Corollary 
2.10 of Paulin (2012a) by the substitution /—>•/ — E^/. For another Bernstein inequality, 
which uses the pseudo spectral gap of the chain, instead of the mixing time, see Theorem 2.5 
of Paulin (2012a). 



2.3. Chains with general state space 

Lezaud (1998b) also proves Bernstein inequalities for chains on general state spaces. However, 
some quantities appearing in these formulas are hard to interpret in practice. We now state two 
such results and explain how they can be employed in practice. 

The following result is a Bernstein inequality for reversible chains on general state spaces, 
adapted from Lezaud (1998b), page 98 (here we have reformulated it to work for functions that 
are not necessarily centered, taking values in [—C,C]): 

Theorem 2.5 (Theorem 1.1, Chapter 6 of Lezaud (1998b)). Let P(x,dy) be a Markov kernel 
over a probability space (Q,J-,7r), with ir being the unique stationary distribution of P. Let 
Xi,X2,-- - be a Markov chain with this kernel. Let f : Q — >■ K, be a bounded function, with 
, < C and Vf := Var w (f) > 0. Let C be as in (2.2). For a distribution p on Q that has a 



density tp relative to ir, denote N p := (E^ 2 ) 1 ^ 2 . Suppose that the linear operator P defined by 
P(x,dy) extends to L 2 {ir), and it is self-adjoint. Denote the initial distribution of the chain by 
q, and let q{k) := P k q. 

Let to > ("burn-in time") and denote Z := ~^^z^~~ ~ ■ Then for every t > 0, 

F q [Z >E w f + t],F q [Z<E w f-t] 

(N - i )i 2 7 



<iV g(io) e^ 5 exp 
<iV g(io) e^ 5 exp 



4Vf(l + h(5tC'/V f )))\ 
(N - t )t 2 7 



(2.14) 
(2.15) 



AV f + 10C't_ 

where 7 is the spectral gap of the chain and h(x) is defined as in (2.5). 

In practice, we often do not know how to estimate N g . However, it is clear that for ergodic 
chains, q(t ) = P to q — > 71 as t — > 00, and thus N q (t ) — > 1 in most situations of practical 
interest. Therefore, for reversible chains, for an arbitrarily initial distribution, if we choose to 
sufficiently large, then roughly same bounds hold as in the countable state space case, see (2.4). 

The following result is a Bernstein inequality for non-reversible Markov chains in general 
state space. It is an improvement on Theorem 1.5 on page 99 of Lezaud (1998b), similarly as it 
is done in Theorem 2.8 of Paulin (2012a): 

Theorem 2.6. Let P(x, dy) be a Markov kernel over a probability space (Q, J 7 , 7r), with tt being 
the unique stationary distribution of P. Let Xi,X 2 , ... be a Markov chain with this kernel. Let 
f : Q — )• R be a bounded function, with \\f\\oo < C and Vf := Var n (f) > 0. Let C be as in 

1 /2 

(2.2). For a distribution p on Vt that has a density ip relative to ir, denote N p := (E^y? 2 ) 
Suppose that the linear operator P defined by P(x,dy) extends to L 2 (ir), and it is self-adjoint. 
Denote the initial distribution of the chain by q, and let q(k) := P k q. Let / ~f((P*) k P k ) be the 
spectral gap of the self-adjoint linear operator (p*) fc p fc ; and define the pseudo spectral gap as 

-y((P*) k P k ) 

7 ps :=sup^-^ '-. (2.16) 

fc>l K 
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Let to > ("burn-in time") and denote Z :- 



I2iLt +i f( x i) 

N-t 



Then for every t > 0, 



F q [Z>EJ + t},¥ g [Z<E n f-t] 

< N q{to) exp 

< N q{to) exp 



8V f (l + h{5tC'/V f )) 
(N - t )t 2 lps 



8V f + 20C't 



(2.17) 
(2.18) 



with h(x) defined as in the previous theorem. 

Remark 2.4. The proof of this is left to the reader as an exercise. 

Again, as in (2.5), N q (t ) — > 1 as to — > oo in most practical cases. It is important to note 
that on general state spaces, the mixing time based on the total variation distance can not be 
used. However, the pseudo spectral gap can be used as an alternative to the mixing time in the 
non-reversible case. 



2-4- Subsampling 

Z := — '"iv-fo — ~~~ ^ s n °t ^ ne om y possible way to approximate E^/. We may decide to only 
average in every mth step (typically, we choose m — I/7 for reversible chains, or m = t m ; x for 
non- reversible chains). Assume, without loss of generality, that 

iV = nm and to = t' m. (2-19) 

Denote X[ := X m , X' 2 := X 2m , ...,X' n := X n . m , and 



n-t' 



(2.20) 



Then X[, . . . ,X' n is a Markov chain, which is reversible if the original chain was reversible. 
In this case, choose m to be odd, then the new transition matrix P m will have second largest 
eigenvalue A m (where A denotes the second largest eigenvalue of P), and thus its spectral gap 
is 7' = 1 — (1 — 7) m . Let m 7 denote the smallest odd number greater or equal to I/7, then with 
the choice m := m 7 , we have 7' = 1 — (1 — r y) m > (this also holds in case 7 > 1). Similarly, 
for non-reversible chains, with the choice m = t m i x , X[, . . . , X' n will have mixing time 1. 

Therefore, with these choices, the reader can see that almost the same concentration inequal- 
ities hold for Z' as for Z (by applying our theorems on Z'). The advantage of this approach 
is that one only needs to compute / in every l/7-th (or t m i x -th) step, which may result in 
considerable savings if / is expensive to evaluate. 

3. Comparison with previous results 



In this section we give a brief review of some widely used MCMC convergence diagnostics and 
error estimation methods. For a more comprehensive overview of available techniques, we refer 
the reader to Cowles and Carlin (1996), Liu (2008) and Diaconis (2011). 
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3.1. Convergence diagnostics 

The most frequently used convergence analysis method, the Gelman-Rubin diagnostic, was 
introduced in Gelman and Rubin (1992) (this was further refined in Brooks and Gelman (1998), 
see also Gelman et al. (2004)). 

Gelman and Rubin (1992) propose a multiple sequence convergence assessment method: first, 
m parallel chains are run for 2n steps. Then the first n terms of each chain are thrown away. 
Finally, B and W (the between, and within sequence variations) are computed for each estimated 
function / of interest: 



m n m 

B ■= — 0— ~ Ef where E® := f l n > ^ : = Yl l m - (3- 1 ) 

3=1 »=i i=i 

W := — vU) where y{j) '■= (f ( X ^) ~ /(n - 1) (3.2) 

m j=i i=i 

From these, one computes the potential scale reduction, R, as a function of B and W. 



R := ^(n-l)/n)W + (l/n)B ^ 

Furthermore, one can estimate the effective number of independent samples, n e g, as 

((n - l)/n)W + {l/n)B 
n eS := mn^ U—!— , (3.4) 

This estimate corresponds to the aggregate number of "independent" samples from the m 
parallel runs. The chain is assumed to have reached convergence when R is sufficiently close 
to 1. The authors recommend R < 1.1 as a threshold adequate in most situations (for more 
details, see pages 294-298 of Gelman et al. (2004)). The main strength of this approach is that 
it is easy to implement, and is available in most statistical packages. However, it does not offer 
a quantitative bound on the error of the estimate - Y^=i /PQ ~ ^nf- 



3.2. Error estimation by the central limit theorem 

A central limit theorem (CLT) for Markov chains was introduced in Kipnis and Varadhan 
(1986): 

Theorem 3.1. Let (Xj)j>i be a stationary, irreducible, reversible Markov chain taking values in 
some general state space Q, with stationary distribution ir. Let f : Q — > R, Z n := - YH=i f(Xi), 
and j t := Cov 7r (f(X i ),f(X i+t )). Then 

oo 

nVar(Z n ) a 2 := j t almost surely. (3.5) 

t=— oo 



// cr 2 is finite, then 

^(Z n -E T f) ^N(0,a 2 ). (3.6) 
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Remark 3.1. This form of the theorem appears in Geyer (1992). For a more precise statement, 
see Theorem 17.0.1 of Meyn and Tweedie (2009). 

The conditions of the above theorem are satisfied by a large class of chains, for instance, both 
Gibbs and Metropolis-Hastings sampling (as defined in Section 1.1) are reversible. 

To make use of the limiting distribution N(0,a 2 ), Geyer (1992) proposes several estimators 
of a 2 . Firstly, the lagged autocovariance jt is estimated by the empirical autocovariance 

i n—t 

%,t = %,-t ■= ~ L/W - Z n\ lf(Xi+t) ~ Zn] ■ (3.7) 

i=l 

Define T n ^ m := 7 nj 2m + 7W,2m+i- The initial positive sequence estimator is defined as 

2m+l m 

°los,n ■= 7n,0 + 2 Y1 ln,i = ~%,0 + 2 ^ ^n.i, (3.8) 
i=l i=0 

where m is chosen to be the largest integer such that T n ^ > for 1 < % < m. The authors also 
define the initial monotone sequence estimator <5"^ onn by replacing Y nyi by minj<jr n j, and the 
initial convex sequence estimator d- 2 orw n by taking the greatest convex minorant. 

For all of these three estimators, Geyer (1992) proves that for almost all sample paths, 

liminf a 2 > a 2 , 

n— >oo 

i.e. the variance a 2 is asymptotically overestimated. Therefore asymptotically conservative 
confidence intervals can be obtained by using the quantiles of iV(0, a 2 ) with any of the three 
estimators. 

The main advantage of the above method is that it applies for general state space reversible 
chains, with any square integrable /. The disadvantage is that it is only proven to work asymp- 
totically, and we often do not know how well the Kipnis-Varadhan CLT approximates the 
normal distribution. Even in the independent case, by the Berry-Esseen bound, there is an 
error of order ^= in Kolmogorov distance. For Markov chains, this can be higher, especially 
when the mixing is slow. 

The advantage of our method is that it works non-asymptotically, and thus we can get more 
reliable error estimates when the mixing time and spectral gap of the chain can be bounded. 

3.3. Error bounds via Ricci curvature 

Ricci curvature for proving concentration of empirical averages of functions of Markov chains 
was introduced in Ollivier (2009), and further developed in Joulin and Ollivier (2010). Here, 
we will give a simple exposition of this method and compare it with our results. 

Let (Q, d) be a Polish space (metric, complete and separable) and define a Markov chain 
Xi, X 2 , . . . on this space with unique stationary distribution ir. We denote the associated tran- 
sition kernel (P x ) x& q so that P x is a probability measure on Q and P x (dy) is the transition 
probability from x to y. The iV-step transition kernel will be denoted by ■ The Ricci curva- 
ture is defined in terms of the Wasserstein distance. The Wasserstein distance of two measures, 
ui and /i 2 on (f2, d) is defined as: 

Wi(ji ltf i 2 ):= inf E a (l[X^Y\), (3.9) 
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ie. the infimum is taken over all couplings of \i\ and 

We say that a Markov chain has positive Ricci curvature k if it satisfies the following as- 
sumption: 

Assumption 3.1. There exists n > such that 

W 1 {P x ,P y )<{l-K)d{x,y), (3.10) 

for any x, y G Q. 

Define the eccentricity E at point x e Vt as 

E(x) :=E Y ^[d(x,Y)\. (3.11) 

Let a(x) 2 denote the coarse diffusion constant defined as 

1 
2 

where 7T; is an independent coupling between the random variables Y ~ P x , Z ~ P x . 
The local dimension n x ,x G Q and is defined as 

UX - /:^S|| L1P <1 KA\f(Y) ~ f(Z)\ 2 } ~ ' ^ ' 

where ||/|| L ip = sup x _^ (\f(x) - f(y)\/d(x, y)) . 
We define the granularity of the Markov chain as 

°~oo '■= 77 sup diam Supp P x , (3-14) 

2 x&Q 

where Supp refers to support. 

y^JV f(X) 

Theorem 3.2 (Theorem 4 of Joulin and Ollivier (2010)). Let Z = and de fi ne 



v) 2 :^-E ni [d{Y,Z) 2 ], (3.12) 



K{N-t ) V N-t J x€ q n x n 
then the following concentration inequality holds (P x and E x refers to initial point X\ = x) 

F x {Z > E X (Z) + t) , F x (Z < E X (Z) - t) (3.16) 
'2exp(-t 2 /(16V 2 ||/||2 ip )) ift>t^ 
2exp(- K (A^ - *o)*/(12*oo||/||Li P )) ift > t mSiX , 



< 



the boundary of the Gaussian window is t max := 4V 2 k(N — to) || /Ulip/ (3<Too) ■ The bias can be 
bounded as in Proposition 1 of Joulin and Ollivier (2010): 

\E X Z - E n Z\ < i— -L—E(x)\\f\\ Lip . (3.17) 

We later show how k can be bounded from below on spin lattice models, see Section 5. 

The above result depends on the distance d and ||/||Lip with respect to this distance. This can 
be advantageous for some general state space models but in discrete state space, its sensitivity 
may ruin the tightness of the bound. 

Our results employ well established quantities of the chain - the spectral gap in the reversible 
case and the mixing time in the non-reversible case - and depend only on the boundedness of /, 
as H/lloo- We also provide methods for estimating all parameters appearing in our inequalities, 
which makes them easier to apply in practice. 
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3- 4- Estimation of the mean square error 

Non-asymptotic results also exist for the mean square error of the MCMC estimate. Gener- 
ally speaking, these results work even for unbounded functions / that have a finite variance 
(with respect to tt). However, their main weakness is that one can only deduce quadratic decay 
through the Chebyshev inequality instead of exponential or Gaussian decay as with concentra- 
tion inequalities. 

We present a simple example of the Chebyshev inequality for finite state, reversible Markov 
chains: 

Theorem 3.3 (Theorem 12.19. of Levin, Peres and Wilmer (2009)). Let (X t ) t >i be a finite 
state, irreducible, aperiodic, reversible Markov chain, with stationary distribution n, and spectral 

gap 7 . If t > t mix (e/2) and N > ± ■ [4 Var n (f)/(v 2 ^)} + *o, denote Z := E, ^ _^° , then (for 
any initial distribution) 

P{|Z-E,/| > 77} < e. (3.18) 

Remark 3.2. Var n (f) can be estimated as in (4.1). 

Further non-asymptotic bounds have also been proposed based on mean square error. Rudolf 
(2011) gives a similar bound on the mean square error using the spectral gap 7 and ||/|| p for 
p > 2. Latuszynski, Miasojedow and Niemiro (2011) gives an asymptotically sharp bound on 
the mean square error as a function of the asymptotic variance a 2 . The bounds are valid on 
general state spaces without assuming reversibility. 

4. Estimation of parameters 

The main difficulty we encounter when applying Theorem 2.2 is that, in general, we do not 
know Vf = Var 7r (/) and a 2 (see (2.6)). Similarly, in Theorem 2.4, we usually do not know V 
(see (2.12)). In many cases, the spectral gap 7 and mixing time t m ; x are also unknown. 

4- 1. Estimation of the variance 

From the definitions, it is easy to see that we can estimate Vf as 



\ U i=t +l / \ U i=t +l / 



Similarly, we can estimate V as 

N 




(4.2) 



N-t 

i=*o+l 



Our next proposition gives a bound on the upper tails of Vf — Vf and V — V: 
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Proposition 4.1. Let ?7 mm (t ) be as in (2.10). Then we have, for any T > 0, 

r f Yt _ Pti ^ + T ) M M^ll (4 ,, 



" mix 



P (V - V > %4tjC + T) < 3exp ( 9(Jy :g; 4c ,„ ) . (4.4) 

Remark 4.1. In practice, in most cases, we will use Vf and V directly, and this proposition 
ensures that for sufficiently large N, the mistake is negligible. 

We will use the monotone sequence estimator of Geyer (1992) to estimate a 2 (see Section 3). 
We denote this estimate by a 2 := &monN-t - 

4-2. Estimation of the spectral gap and the mixing time 

Precise estimation of the spectral gap and the mixing time from the realizations f(Xi), . . . , 
f(Xjsr) is not possible, since it is a property of the Markov chain Xi, . . . ,X^ itself, and by 
applying the function /, we lose information. Nevertheless, in practice, we have found that the 
simple estimate in (4.6) works well. We now give a brief justification of this approach. 

For reversible chains with state space f2, transition matrix P, and stationary distribution ir, 
define / 2 (vr) as the Hilbert space of real valued functions on Q, with scalar product 

(f,g) ■= ^2f(x)g(x)ir(x). 

Let {(fii}i>i be an orthonormal basis made of the eigenvectors of P, corresponding to eigenvalues 
(^i)t>i- Then the largest eigenvalue Ai = 1, and (pi = 1, and obviously A2 = 1 — 7. 
By Proposition 1.5 on page 48 of Lezaud (1998b), we have 

- 2 = E</.^ 2 ^S^, (4.5) 

i>2 1 ' 

thus 7 < 2Vf/a 2 for any function /. With the choice / = ipi, we have a 2 = (2 — 7V7, and 
Vf — 1, thus 2Vf/o~ 2 = 7-2/(2 — 7), which is indeed very close to 7 (in practice, 7 < 1). 

For this reason, if we are only equipped with the values of a single function /: f(Xi), . . . , f(X N ), 
then we propose the estimate 

2V f 

7 := -zf- (4-6) 
o~ A 

In case we have the values f(Xi), . . . , f(X^) for several functions /1, /2, • • • , fk, then denote 
the corresponding estimates of a 2 and Vf by a 2 , ... , o\ and V/ x , . . . , Vf k . We estimate 7 as 

7 := min (4.7) 

i<i<k af 

A popular method for assessing convergence of the chain has been proposed by Gelman and Rubin 
(1992). We will make use of this method to indirectly estimate the mixing time. Having ob- 
tained an estimate of n e g with m parallel chains taking n steps (see Section 3), we can obtain 
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an estimate of the mixing time. Since n C R corresponds to the aggregate number of "indepen- 
dent" samples from m runs, n c ^jm is the average number of such samples per run. Following 
our argument on subsampling in Section 2.4, we propose the following estimate for the mixing 
time: 

nm 

:= • 4.8 



With 7 estimated as in (4.6) and t m j x as in (4.8), the Bernstein inequality for reversible chains 
becomes 



P[Z>E*/ + *],P[Z<E*/-*] 



< exp 



5a 2 



exp 



(N-t )t 2 



2a 2 + 5 c - 



v* 



(4.9) 



I Jo 

L *mb 



with C defined as in (2.2). Assuming that our estimate of £ mix in (4.8) is the correct order of 
magnitude, we can choose t large enough for the second term to become negligible. 



4-3. Advice to practitioners 

When applying MCMC methods in practice, the mixing time and spectral gap of the chain is 
often not known. Using estimates for the relevant parameters as described above, we recommend 
the following procedure (for reversible chains): 

1. Estimate t m i x using (4.8) (if too close to n, repeat with larger n). 

2. Run two parallel chains for kt mix iterations. We recommend using k = 1000 based on 
Proposition 4.1. Set to = 10t m j X . Estimate a 2 and Vf for both chains (using the monotone 
sequence estimator d' 2 aonN _ to , and (4.1)). If, for these 2 chains, the estimated values are 
far away, then increase the number of iterations. 

3. Use (4.9) with the estimated values of a and Vf to obtain the necessary number of steps 
for a given precision, and make the final simulation with this amount of steps. 



5. Simulations 



In the following, we present simulation results to demonstrate the applicability of the intro- 
duced error bounds. We are interested in the empirical tail probabilities of estimates, obtained 
from multiple runs of MCMC simulations. In particular, we will estimate logarithms of tail 
probabilities of the following form: 

log ( P ( Et V-l™ - E - / + t ))- (5 ' 1} 

We simulate m parallel chains and denote the sequence of states of the jth chain (1 < j < m) 
by Xi \ . . . , Xff . Then the empirical average obtained by the jth chain can be written as 

N f ( V U) 



- N - i ' < (") 
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and denote 

E:=-Ye {3) . (5.3) 

3=1 

Define the mean-shifted empirical distribution of these estimates as 

F(t) := - V 1[E® -E<tl (5.4) 
i=i 

and let 

flogfFft)) fort<0, and 

> " \ (5-5) 
log (l -F t ) fort>0, 



thus L(i) is an estimate of the log tails in (5.1). 
5.1. Lattice models in statistical physics 

We first consider simulations on the Curie- Weiss model and the Ising model (ID and 2D). These 
models and their variants are widely studied in the context of MCMC simulations, and for some 
special cases, the mixing time and spectral gap of the chain are known. 

5.1.1. Definition of models 

Let us assume that a := (<7i, . . . , a Hs ) are spins taking values 1 or —1, and distributed according 
to the probability distribution of the model, which is of the form (for some /3 > 0) 

P(o-i,...,<T n J = 1 , (5.6) 

where Hp^a) is the energy function, /3 is the inverse temperature, h corresponds to the external 
field, and Z = exp(H/3 th (a)) is the partition function. 

In the case of the Curie- Weiss model, we define the energy function as 



H Pth (v) = H™ {&)-.= ± a, a, ■ (5.7) 

X<i<.j<n s i=l 

In the case of Ising model on a graph G = (V,E), we say that i ~ j if there is an edge 
between i and j in G, and define H as 

Hp >h (a) = H^ h (a) := ^ o % a 3 +hJ2°i- (5-8) 

i~j i=l 

In the 1-dimensional Ising model, G consists of the edges for 1 < i < n s — 1, while in 

the 2 dimensional case, G consists of the edges on a square lattice. We use periodic boundary 
conditions so that each spin is connected with the same number of other spins. 

In practice it is impossible to obtain independent samples from (5.6). Therefore one typically 
designs a Markov chain which has (5.6) as its stationary distribution. In the following section 
we present analytic estimates of the properties of one such typically used Markov chain, the 
Glauber dynamics (a Gibbs sampler chain on the lattice model, as defined in Section 1.1). 
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5.1.2. Spectral gap and mixing time 

In the case of the Curie- Weiss model, at high temperature, with no external field, {(3 < 1 and 
h = 0), Theorem 1 of Ding, Lubetzky and Peres (2009) shows that for the Glauber dynamics, 
the spectral gap and mixing time are 



cw (l + o(l))(l-/3) cw l n s log((l-/3)X) 

7 = 1 , ^mix = o ; o +° 



n s 7 mix 2 1-/3 \1 -13 

so we will use the following approximate values: 

*cw ._ 1 -Z 3 lew ._ 1 n> log((l -ff) 2 w B ) , , 

7 - ^ , t mix .- 2 x _ ^ . ^.yj 

In the high temperature case ((3 < 1), but with h ^ 0, we can still apply Theorem 3 of 
Bubley et al. (1997) to show that the mixing time satisfies 

imix(e) < \n s log(n s /e)/ (1 - /?)] , and thus t mix < \n s log(4n a )/ (1 - /?)] , 

therefore we expect (5.9) to be good approximation in this case as well. Ding, Lubetzky and Peres 

(2009) also shows that for the critical (P = 1, h = 0) case, the mixing time is O ^n^ 2 j. For 

low temperature ((3 > 1, h = 0), the mixing time is exponential in n s . The inverse spectral gap, 
I/7, has the same order as the mixing time for both the critical and the low temperature case 
respectively. 

In the case of d dimensional Ising model with periodic boundaries (i.e. n d s spins in total), 
Lubetzky and Sly (2009) shows that for the continuous time Glauber dynamics (i.e. Glauber 
dynamics, with i.i.d rate-one Poisson clocks on each spin), the mixing time satisfies 

*m?x = -4«log(n s ) + 0(log(log(n s )), (5.10) 

27oi 

where 7^, Gi is the spectral gap of the Glauber dynamics on the d dimensional infinite lattice. 
The same result holds for the Metropolis algorithm, with 7^ G/ replaced by ■y^ Metr °p° hs _ 

The spectral gap of the finite model satisfies, by Theorem 4 of Lubetzky and Sly (2009): 
under some weak conditions (strong spatial mixing) on (3 and h, |7^ G/ — 7^, G '| < n s l / 2+o( - l \ 

For the 1-dimensional case with h = 0, strong spatial mixing holds, and 7^ G ' = 7 1,G/ (n s ) = 
1 — tanh(2/3) (independently of n s ). This means that for h — 0, 

= -Talog(n s ) + 0(log(log(n s )) 

27oi 

log(n a ) + 0(log(log(n fl )). 



2(1 - tanh(2/3)) 

The above results hold for continuous time Glauber dynamics. However, we use discrete time 
Glauber dynamics, and thus adopt a modified version of these results. Since in the continuous 
case, there are in total n d s rate-one Poisson clocks, in the discrete case, it is natural to expect 
a slowdown of n d s times in the spectral gap, and mixing time. Therefore, we write 



(5.11) 
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In the 2-dimensional case, although (5.10) still applies for the continuous time Glauber dy- 
namics, the explicit form of 7^ G ' as a function of /3 and h is not known. Therefore, we will use 
a modified version of (5.9): 

~2,gi _ 1 ~ g/fl, p,Gi _ l nglog((l-/3//3 c ) 2 ^) 

7 : " nl ' *"•*»- 2 l-/3//3 c ' 1 j 

where /3 C = |log(l + \f2) is the critical inverse-temperature. 

5.1.3. Simulation results for total magnetization 
We will be interested in the total magnetization, i.e. 

f(a) = m(a) := 

i=i 

In the case of no outside field (h = 0), we have, by symmetry, Em(cr) = 0, for all of our 
models. 

For some fixed /3,h, and random initial distribution (uniformly chosen in a), we run m 
Markov chains. The simulation results are shown in Figure 1. We observe that in our examples, 
the Bernstein- inequality, based on (4.9), provides a tight upper bound on the tail probabilities. 
In contrast, the normal quantiles based on the monotone sequence estimator of a 2 commonly 
underestimate the number of samples needed for a certain estimate precision. The Hoeffding- 
inequality does not incorporate the variance of /, and thus gives a weak bound on the tail 
probabilities. 

For the Ricci curvature method, we plot (3.16). We choose the distance d(x,y) as the Ham- 
ming distance, i.e. d(x,y) = J2i< n ^ vH- Then H/Hup = 2. In the case of the Ising models, 
Example 17 of Ollivier (2009) shows that 

1 / e /3 -e~ /3 \ 
^-{l-v^^^j, (5.13) 

with v max being the maximum number of neighbors in the graph {v^ x = 2, = 4). For the 
Curie- Weiss model, because (5.7) contains l/n s in the sum, the bound becomes 

Kcff >-(l-K-l) eWn , +e _ g/ „J . (5.14) 

Furthermore, in every case, we have n x > 1, < n s , and a 2 (x) < 2, E(x) < n s . The bias 
term (3.17) is negligibly small with our parameters. 
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(c) Curie- Weiss model, Glauber dynam- 
ics, low temperature. Lattice size 10, 10 6 
runs, N = 10 5 , to = 11262, T = 1//3 = 0.50, 
h = 0, 7 dat = 9.66 ■ 10~ 4 , a 2 = 1.75 ■ 10 5 , 
Wx.dat = 1.13 • 10 3 , C = 10 



(d) 1-D Ising model, Glauber dynamics. 

Lattice size 100, 10 6 runs, N = 10 5 , to = 5260, 
T = 1//3 = 2.00, h = 0, 7 1,G1 = 2.38 ■ 10~ 3 , 
-3 i2_, o5.ln5 xl.Gl 



"'fdat 



2.79-10- 



10 2 , t mix ,dat = 5.26 



1.93- 10 5 
10 2 , C = 100 




■ Simulation 

■ Hoeffding 

■ Bernstein 

■ Bernstein estimated 
1 CLT 

Ricci curvature 



(e) 2-D Ising model, Glauber dynamics. 



7250 



T=l//3 = 
= 3.18-10- 



5.00, ft = 0,7 
3 , ct 2 = 1.78-10 



2,Gi = 5.46-10- 3 , 

f2,Gl 



/ 



3.11- 



10 2 , t mix , da t = 7.25 ■ 10 2 , C = 100 



Fig 1: Simulation results for lattice models. The simulation result is plotted according to (5.5). When 
formulas arc available for the mixing time and the spectral gap ((a),(c) and (e), see Section 5.1.2), the Hoeffding 
bound and Bernstein bound are plotted according to (2.1) and (2.7) respectively. We use estimated values of 
the parameters 7dat, imix,dat, <7 2 and Vf (see Section 4), and plot the estimated Bernstein bound according to 
(4.9). We also show the quantiles of N(0,a 2 ) arising from the CLT (a 2 is estimated as in Geyer (1992), see 
Section 3). The bound based on Ricci curvature method is plotted according to (3.16). 
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5.2. Bayesian model averaging 

In this section we look at simulation results on the space of directed acyclic graphs (DAGs) for 
Bayesian model averaging. 

5.2.1. Definition of the model 

DAGs are commonly used to encode the factored representation of a high- dimensional joint 
probability distribution. Let us consider a graph G = (X,E), where X = (Xi, X 2 , . . . , X n ), 
is a set of vertices, each representing a random variable, and E is a set of directed edges 
between these variables. For a node Xi G X, we denote the set of its parents as Pa(Xi), where 
Xj G Pa(Xi) if and only if (Xj,Xi) G E. The non-descendents of a node Nd(Xi) consist of 
the nodes to which there is no directed path from X^. The DAG structure entails conditional 
independence relations among the variables of the following form: 

Xi AL Nd(Xi) | Pa(Xi), l<i<n (5.15) 

With these independence assumptions, the joint distribution of the variables factors as 

n 

¥(X 1 ,X 2 , ,X n ) = n F{Xi\Pa{Xi)) (5.16) 

i=i 

Now, given a set of observations D, we attempt to make predictions about a function / of 
the model structure G. D consists of vectors of realizations of (X\, X 2 , . . . , X n ). One could find 
a single best DAG structure with respect to D, and use it to calculate the value of /. Instead, 
we follow a Bayesian model averaging approach, where we calculate the posterior probability 
of each possible DAG structure and use it as a weight when making the prediction. We refer 
to Neapolitan (2004) for a more detailed account of this approach. The prediction E [/((?) |.D] 
can be expressed as a weighted average of individual predictions based on each possible DAG 
structure g: 

E [f(G)\D] = f(9MG = g\D), (5.17) 

where Q n is the set of all DAG structures on n variables. 

The model G is parametrized using a set of conditional probability tables describing the 
probability of each node taking a certain value given its parents. We denote the set of all such 
parameters 6 G . 

The posterior probability of a structure G can be obtained by applying Bayes theorem on its 
marginal likelihood. The marginal likelihood is generally expressed as 

F(D\G)= [ F(D\0 G ,G)¥(9 G )d9 G . (5.18) 

In our current example, we assume that each random variable is binary, that is, Xi G {0, 1}. 
As is typically done in the context of binary DAG models, we set a beta distribution as the 
prior distribution of each variable conditioned on its parent configuration. 
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Using beta priors, Heckerman, Geiger and Chickering (1995) shows that the marginal likeli- 
hood can be calculated as 

where % refers to a node X^, j is a value configuration of the parents of node Xi, with qi the 
total number of parent value configurations, k indicates the value of node Xi under parent 
configuration j, and is the number of different values that Xi can take. For each combination 
of indices, d^ and represent the observed count, while s^- and are the prior counts. To 
make priors consistent among different DAG structures, we choose a fix equivalent sample size 
S, and set s^k = — ■ 

For simplicity, we assume that the prior probability for each structure is equal, that is, 
VGGfi n ,P(G) = ^. 

As the number of summation terms in (5.17) can be prohibitively large to compute exactly, 
we design a Markov chain with stationary distribution ¥(G\D) and use a Monte Carlo estimate 
to approximate the prediction. 



5.2.2. Procedure 



We follow Madigan, York and Allard (1995), and design a Markov chain on Q n with stationary 
distribution F(G\D). Starting with an initial DAG structure, the chain either adds or removes 
a single edge at each proposal step. We denote the neighborhood of a state Gi as Nb(Gi), which 
is the set of DAGs that differ from Gi by one edge addition or one edge removal. The chain 
then uses the following probabilities to propose the next state: 

Gj e Nb(Gi 



T(GAGi) = { W G *)I' WJ ----v-v (5 _ 20) 
' 0, Gj £ Nb(Gi) V ; 

When making the proposal, we make sure that only valid (cycle-free) DAGs are considered. 
The chain moves to the proposed state with the following acceptance probability: 

A{G 3 \G{) - mm . (5.21) 

Note: The ratio of marginal likelihoods can be evaluated locally at the target node of the 
single edge that is changed during the proposal step. As opposed to some of the lattice models 
discussed in the previous section, here, no analytic formulas are known for the mixing time and 
spectral gap of the Markov chain. 



5.2.3. Simulation results 



In the following simulation example, we have a set of n = 6 variables, thus the space of the 
Markov Chain consist of DAGs with 6 vertices. We take a data set D consisting of 20 vectors 
generated from a known DAG on 6 nodes (structure not shown), and assume a prior equivalent 
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-0.3 -0.2 -0.1 0.1 0.2 0.3 0.4 0.5 

f 

(a) Graph size 6, 10 5 runs, N = 2 ■ 10 4 , t = 
1958, % at = 8.35 ■ 10~ 3 , a 2 = 5.84 ■ 10 1 , 
imix,dat = 1.96-10 2 ,C = l 

Simulation 

Bernstein estimated 

■ - - ■ Geyer 



-0.8 -0.6 -0.4 -0.2 0.2 

t 

(b) Graph size 6, 10 5 runs, N = 2 ■ 10 4 , t = 
8773, 7 dat = 8.18 • KT 7 , & 2 = 3.27 • 10 1 , 
tmix,dat = 8.77- 10 2 , C = 1 



Fig 2: Simulation results for Bayesian model averaging. The simulation result is plotted according 
to (5.5). We use estimated values of the parameters 7d a t, tmix.dat, °~ 2 and V; (see Section 4), and plot the 
estimated Bernstein-bound according to (4.9). We also show the quantilcs of N(0,6- 2 ), arising from the CLT 
(<7 2 is estimated as in Geyer (1992), see Section 3). 



sample size of 4. Our goal is to estimate the posterior probability of an edge being present in 
the structure: 

We look at two cases, first, at the presence of the edge e a = (i — l,j = 2), and then at 
e& = {i = l,j — 4). The simulation results are shown in Figure 2 (a) and (b) respectively. These 
figures show examples of exponential tails, for which our proposed Bernstein bound provides a 
tight upper bound. The normal quantile based estimate is poor on the side with exponential 
tail. 



6. Final remarks 



In order to get rigorous, sharp error bounds for empirical averages in MCMC, one needs to know 
the mixing time of the chain (for setting the "burn-in time" to sufficiently large), the spectral 
gap (for reversible chains), and the concentration properties of the function / at the stationary 
distribution. The Hoeffding inequalities only use the lower and upper bounds on /, while Bern- 
stein inequalities take into account the variance of / as well. Our simulation results show that 
this distinction is important for obtaining tight error bounds. While the normal approximation 
by the central limit theorem can only handle Gaussian tails, concentration inequalities are also 
applicable in case of exponential tails that arise in practice. 

It would be interesting to get even sharper results, under additional conditions on /. For 
instance if / has Gaussian or exponential tails (such tail inequalities are proven for statistical 
physical systems satisfying the Dobrushin condition in Paulin (2012b)), one could get sharper 
error bounds, since it is the typical deviation of / that really matters and not its maximal 
range. 
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For further practical examples where the bounds we have presented can be used, we refer 
the reader to Gilks, Richardson and Spiegelhalter (1995), Liu (2008) and Landau and Binder 
(2009). 
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7. Appendix 

Proof of Theorem 2.3. Marton (1996) proves measure concentration in Hamming distance for 
countable state Markov chains. For a homogeneous, ergodic Markov chain with state space fl, 
and transition probabilities Pjj, let us denote 

q := max dxv (Pi,-, Pj,-)- (7-1) 

Then Proposition 1 of Marton (1996) proves that measure concentration holds with con- 
stants 1/(1 — q) 2 times worse than in the independent case (see also Kontorovich (2007) and 
Chazottes et al. (2007)). In particular, Mcdiarmid's bounded differences inequality holds with 
1/(1 — q) 2 times weaker constant than in the independent case: 

Proposition 7.1. Suppose that g : Q n — > R is C-Hamming Lipschitz (i.e. g(x) can change at 
most by C if we change only one coordinate in x ), and let X = (Xi, . . . , X n ) be a homogeneous, 
ergodic Markov chain taking values in Q, then for every X, 

\ 2 Cn 

logEexp(A(^(X) -Eg(X))) < ^73^' ( 7 - 2 ) 

and thus 

P (g(X) > Eg(X) +t),P (g(X) < Eg(X) - t) < exp (- 2(1 , (7.3) 

Fix some < e < 1/2. Suppose, without loss of generality, that N is divisible by t m ix(e), and 
let n = iV/t mix (e). Divide X\, . . . , X^ into t m ix(e) groups such that the indexes of the elements 
in these groups are at least t m i x (e) distance from each other: 

y (1) ■= (Y^, F«) := (X u X 1+tmiAe) , . . . , X 1+(n _ 1)imix(e) ) , 

y(n) ;= ^ W«) > _ jrWW) ;= (X Wx(e) , X 2W(£) , . . . , X ntmh[{e) ) . 

Now we use a trick from the proof of Theorem 1 of Janson (2004). Denote 

N tmix(e) n 

i=l j=l i=l 

then by Jensen's inequality, 

1 



E(exp(XW))< 

Wx(e) / / 

■ E ex P ^«(e) ■ [/(^-(^(i-il+j) - E /( x WxW(i-i)+i)] 
j=i \ \ i=i 

Now we notice that {X tmix ( e ){i-i)+j}i<i<n is a Markov chain by itself, and it is easy to see that 
for this chain, q < 2e, thus we can apply (7.2), with C = b — a: 



E (exp (AH/)) < exp 



X 2 n(b — a) -t T 
8(1 -2e) 2 
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and thus, by Markov's inequality, 

-2t 2 (l -2e) 2 
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p fe f ^ ^ E E/ ™ + M ^ exp (- 



,iV(6-a) 2 t mix (e) 

To get (2.11), we only need to rescale this, change N to N — t , optimize in e, and show that 

r](t ){b - a) 



1 N 



N — t 



i=t +l 



< 



N-t 



o 



these are left to the reader. 
Proof of Proposition J,..!. We have 



V, = E w f 2 - (E n f) 2 , and 



A' 



Vf 



N-t 



n . 



i=to+l 



N — t 







Define 



iV 



" E /W> and 



A := 



N — t 



- (E^./) , then 



Vf-V f = D 1 + D 2 . 
The upper tail of D\ can be bounded by Theorem 2.3: 

Ppl > VmMC 2 +t)< exp 

Now we bound the upper tail of D 2 : 
Do -- 



2t 2 (N-t c 



CH 



mm 
mix 



Si=t +1 f( X i 



N-t 



o 



Y2i=tn + 1 f( X i 



H=t +1 



< 



< 2C 



N — 


to 




f(*i) 


N — 


to 


2-a=t +l 


f{Xi) 



N-t 



o 



2~2i=t +i fjXjj 
N-t 



Y.iLtp+1 f( x i) 

N-t 

,Si=tn+l f(Xi 



N-t 



E./ 



+ E,/ 



Eilin + l f( X i) 







N-t 







,Si=i +l f( X i 



N-t 



o 







□ 



(7.4) 
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therefore Theorem 2.3 gives 



P( J D 2 >4r /min (t )C ,2 + t) <2exp 

Combining (7.4) (for t/3) and (7.5) (for 2£/3), we get 

P(Di + D 2 > 5 Vmm (to)C 2 + t)< exp 
2(2/3t) 2 (N-t ) 



2 exp 



ACHl 



2t 2 (N-t ) 
4C ,4 Ci" 



2(t/3) 2 (iV-t ; 

''mix 



so (4.3) follows. The proof of (4.4) is similar. 



